Task: Data Analysis

For this task we will use Andy Warhol’s iconic portrait: Liza Minnelli

library(imager)

im <- imager::load.image("liza_minnelli_andy_warhol_collection.jpg")
plot(im)

This picture is stored as a ‘cimg’ object, which is bascially a 4-dimensional array. The first index is the horizontal pixel, the second is the vertical, the third is the opacity, and the fourth is the colour (R,G,B)

And we can turn this into something useful for clustering by using the ‘as.data.frame’ method with the option wide = “c” (This option only works for a ‘cimg’ object.) We then rename the three colours to “R”, “G”, and “B” using ‘rename’ function from ‘dplyr’.

library(dplyr)

tidy_data <- as.data.frame(im, wide = "c") %>% 
  rename(R = c.1, G = c.2, B = c.3)

head(tidy_data, 5)
##   x y         R         G         B
## 1 1 1 0.7960784 0.5647059 0.5411765
## 2 2 1 0.5882353 0.3019608 0.2745098
## 3 3 1 0.6470588 0.2784314 0.2470588
## 4 4 1 0.6941176 0.2588235 0.2235294
## 5 5 1 0.7098039 0.2509804 0.2156863

Because ‘class’ has type ‘cimg’ (type ‘class(im)’ to confirm), when we call ‘as.data.frame’ R finds the version of ‘as.data.frame’ that works on on that type of object. In this case it finds the internal function ‘imager:::as.data.frame.cimg()’, which has the ‘wide = “c”’ argument. (The three :s means that the function is internal to the package.)

We now have the data in the foremat required to the clustering. Explore various k-means clustering using the template laid out in the Learning K-Means with tidy data principles vignette.

First things first, let’s make the scree plot.

library(purrr)
library(tidymodels)

dat <- select(tidy_data, c(-x, -y))

kclusts <- tibble(k = c(2:10)) %>% 
  mutate(kclust = map(k, ~kmeans(x = dat, centers = .x, nstart = 4)),
         glanced = map(kclust, glance))

clusterings <- kclusts %>% 
  unnest(cols = c(glanced))

ggplot(clusterings, aes(k, tot.withinss)) +
  geom_line() +
  geom_point() +
  labs(x = "Number of Clusters", y = "Within groups sum of squares")

Maybe 6 is the right number of clusters? It’s hard to tell. Hence, we will try the ratio version.

nclust = length(clusterings$k)
ratio = rep(NA, nclust-1)

for (kk in 2:nclust) {
  ratio[kk-1] = clusterings$tot.withinss[kk]/clusterings$tot.withinss[kk-1]
}

plot_data <- data.frame(k = clusterings$k[2:nclust], ratio)

ggplot(plot_data, aes(x = k, y = ratio)) +
  geom_line() +
  geom_point() +
  labs(x = "Number of Clusters", y = "Within groups sum of squares")

From this the number of clusters seems to be six! So let’s use that going forward.

First, let’s re-do the clustering and save the centres.

k <- 7
kclust <- kmeans(select(tidy_data, -x, -y), centers = k, nstart = 20)
centres <- tidy(kclust)

We can also add a column to the tidied centres to add the colour in a way that we can use for plots. The ‘rgb’ function will do this and display the colour as a hex string.

centres <- centres %>% 
  mutate(col = rgb(R, G, B))

centres
## # A tibble: 7 x 7
##       R     G     B   size withinss cluster col    
##   <dbl> <dbl> <dbl>  <int>    <dbl> <fct>   <chr>  
## 1 0.567 0.500 0.493   7551    142.  1       #91807E
## 2 0.827 0.698 0.641  84195    134.  2       #D3B2A3
## 3 0.251 0.203 0.315  20316    234.  3       #403450
## 4 0.861 0.733 0.680 186232    158.  4       #DCBBAD
## 5 0.738 0.266 0.270 491573    249.  5       #BC4445
## 6 0.107 0.102 0.110 248942    196.  6       #1B1A1C
## 7 0.711 0.130 0.135   9767     24.9 7       #B52122

It’s probably worth seeing what the colours are. In this case, we will use ‘show_col’ from ‘scales’.

library(scales)
show_col(centres$col)

Visually, we can see that two of these colours are skin tones. Let’s see what happens if we choose 6 colours.

kclust6 <- kmeans(select(tidy_data, -x, -y), centers = 6, nstart = 20)

centres6 <- tidy(kclust6)

centres6 <- centres6 %>% 
  mutate(col = rgb(R, G, B))

show_col(centres6$col)

It’s slightly different but probably better. This is one of those cases where the scree plot can be misleading and using visualizations can help.

So now we have six clusters we need to put the do the cluster centre replacement. To do this, we first need to augment the initial data with the clusters. We can do this with ‘broom::augment’ function (‘broom’ is a package loaded by ‘tidymodels’). The ‘rename’ command just makes the naming a little nicer.

tidy_data <- augment(kclust6, tidy_data) %>% 
  rename(cluster = .cluster)

glimpse(tidy_data)
## Rows: 1,048,576
## Columns: 6
## $ x       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
## $ y       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ R       <dbl> 0.7960784, 0.5882353, 0.6470588, 0.6941176, 0.7098039, 0.7176…
## $ G       <dbl> 0.5647059, 0.3019608, 0.2784314, 0.2588235, 0.2509804, 0.2588…
## $ B       <dbl> 0.5411765, 0.2745098, 0.2470588, 0.2235294, 0.2156863, 0.2313…
## $ cluster <fct> 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…

We can now plot the clustered picture!

ggplot(tidy_data, aes(x = x, y = y, fill = cluster)) +
  geom_tile() +
  scale_discrete_manual(aesthetics = "fill", values = centres6$col)

We can see that Liza is upside down.

ggplot(tidy_data, aes(x = x, y = y, fill = cluster)) +
  geom_tile() +
  scale_discrete_manual(aesthetics = "fill", values = centres6$col) +
  scale_y_reverse() +
  theme_void()